System and Method for Equal Perceptual Relevance 
Packetization of Data for Multimedia Delivery 



Cross Reference to Related Applications 

This application claims the benefit under 35 USC 1 19(c) of United States 
provisional application 60/334,521, which was filed on November 30, 2001 . The 
application also relates to the co-pending patent application entitled "System and 
Method for Encoding Three-Dimensional Signals Using A Matching Pursuit 
Algorithm'', Serial No. , which claims the benefit under 35 USC 1 19(c) of 

United States provisional application 60/334,521, filed November 30, 2001, as 
well as the co-pending patent application entitled 'Transcoding Proxy and Method 
for Transcoding Encoded Streams", Serial No. , which claims the benefit 

under 35 USC 1 19(c) of United States provisional application 60/334,514, filed 
November 30, 2001. 

Field of the Invention 

This invention relates generally to digital signal representation, and more 

particularly to an apparatus and method to hnprove the delivery quality of a digital 

multimedia stream over a lossy packet network. The invention has particular 

application with regard to the real-time streaming of compressed audiovisual 

content over heterogeneous networks. 
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Background of the Inventioii 



The purpose of source coding (or con^ression) is data rate reduction. For 
example, the data rate of an uncompressed NTSC (National Television Systems 
Committee) TV-resolution video stream is close to 170 Mbps, which corresponds to 
less than 30 seconds of recording time on a regular compact disk (CD). The choice of 
a compression standard depends primarily on the available transmission or storage 
capacity as well as the features required by the application. The most often cited 
video standards are H.263, H.261, MPEG-1 and MPEG-2 (Moving Picture Experts 
Group). The aforementioned video compression standards are based on the techniques 
of discrete cosine transform (DCT) and motion prediction, even though each standard 
targets a different application (i.e., different encoding rates and qualities). The 
applications range from desktop video-conferencing to TV channel broadcasts over 
satellite, cable, and other broadcast channels. The former typically uses H.261 or 
H.263 while MPEG-2 is the most appropriate compression standard for the video 
broadcast appUcations. 

Motion prediction operates to eflBciently reduce the temporal redundancy inherent 
to most video signals. The resulting predictive structure of the signal, however, makes 
it vulnerable to data loss when delivered over an error-prone network. Indeed, when 
data loss occurs in a reference picture, the lost video areas will affect the predicted 
video areas in subsequent frame(s), in an effect known as temporal propagation. 

Tri-dimensional (3-D) transforms offer an alternative to motion prediction. In this 

case, temporal redundancy is reduced the way spatial redundancy is; that is, using a 
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mathematical transform for the third dimension (e.g., wavelets, DCT). Algorithms 
based on 3-D transforms have proven to be as efficient as coding standards such as 
MPEG-2, and comparable m codmg efficiency to R263, In addition, error resilience 
is improved since compressed 3-D blocks are self-decodable. 

Non-orthogonal transforms present several properties that provide an mteresting 
alternative to orthogonal transforms like DCT or wavelet. Decomposing a signal over 
a redundant dictionary improves the compression efficiency, especially at low bit rates 
where most of the signal energy is captured by few elements. Moreover, video signals 
issued from decomposition over a redundant dictionary are more resistant to data loss. 
The main limitation of non-orthogonal transforms is encoding complexity. 

Matching pursuit algorithms provide a way to iteratively decompose a signal mto 
its most important features with limited complexity. The matching pursuit algorithm 
will output a stream composed of both atom parameters and thdr respective 
coefficients. The problem with the state-of-the-art in matching pursuit is that the 
dictionaries do not address the need for decomposition along both the spatial and 
temporal domains, and also the optimization of source coding quality versus decoding 
complexity for a given bit rate. 

The art in Matching Pursuit (MP) coding is limited. A publication by S. G. Mallat 
and Z. Zhang, entitled '^Matching Pursuits With Time-Frequency Dictionaries'\ 
Transactions on Signal Processhig, Vol. 41, No. 12, December 1993 details one 
application of matchmg pursuit coding. In addhion, the publication entitled ''Very 
Law Bit'-Rate Video Coding Based on Matching Pursuits'\ by R. Neff and A. Zakhor, 
Circuits and Systems for Video Technology, Vol. 7, No. 1, February 1997, the 
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publication entitled ''Decoder Complexity and Performance Comparison of Matching 
Pursuit and DCT-Based MPEG-4 Video Codecs'\ by R. Neff, T. Nomura and A. 
Zakhor, Circuits and Systems for Video Technology, Vol, 7, No. 1, February 1997, 
and U.S. Patent No. 5,699,121, detail using a 2-D matching pursuit coder to compress 
the residual prediction error resulting from motion prediction. 

The shortcomings of the prior art include, first, that matching pursuit has never 
been proposed for coding 3-D signals. Second, the basic functions have been limited 
to Gabor functions because they were proven to minimize the uncertainty principle. 
However these functions are generally isotropic (same scale along x- and y-axes) and 
do not address image characteristics such as contours and textures. The above- 
referenced co-pending patent application discloses a 3-D encoding system and 
method. 

Trammitting multimedia m digital form is Ihe direct result of the benefits offered 
by digital compression. The purpose of compression is data rate reduction, which 
results in lower transmission costs. However, distortion which the end-user perceives 
results firom compression artifacts, packet losses, delays, and delay jitters. All lossy 
multimedia compression schemes distort and delay the signal. Degradation mainly 
comes fi-om the quantization, which is the only irreversible process in a coding 
scheme. Moreover, delays and packet losses are inevitable during transfers across 
today's networks. The delay is generally caused by propagation and queuing. 
Multiplexing overloads of high m^nitude and duration, leading to buffer overflow in 
the nodes, mainly causes information loss. Data loss is particularly annoying in video 
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streaming applications due to the predictive structure of the compression techniques 
such that loss of packets creates p^ceptible video interruption for an end-user/viewer. 

Int€a:active multim^ia delivery can significantly be improved by providing 
sender-side, in-network mechanisms. These include (i) structuring techniques and 
scalable coding to reduce data loss sensitivity, and (ii) forward error correction (FEC) 
mechanisms to lower the probability of loss at the application layer. On the sending 
end, redundancy is added to the data so that the receiver can recover from losses or 
errors without any further mtervention fi^om the sender. FEC techniques also often 
take advantage of the underlying multimedia content leading to an equal error 
protection scheme. The former results in a higher protection while being 
computationally heavy. The latter, while bring less efficient, can easily be 
implemented within the network, in so-called gateways. 

Most of the multimedia deUvery schemes produce packets with highly 
different value. For example, a loss of a packet containing a portion of an MPEG I 
fi-ame has much higher visual impact than the loss of a packet containing a portion of 
an MPEG B fi-ame (temporal propagation). However, any packet has the same 
probability of bemg lost on best effort networks. 

What is needed, therefore, and what is an objective of the invention, is a 
system and method for creating data packets of equivalent perceptual value to the end- 
user and of as equal length as possible, whereby packet loss induces the same 
perceptual degradation independently of its location m the multimedia stream. 

Yet another objective of the invention is to provide a system and method which 
facilitates easy error protection and stream thinnmg m multunedia gateways. 
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Summary of the Invention 



The foregoing and other objectives are realized by the present invention which 
provides an apparatus and method for improving the delivery of a digital stream over 
an OTor-prone packet network. The method comprises creating data packets of 
equivalent perceptual relevance to the end-user and as of equal length as possible, 
such that packet loss mduces the same perceptual degradation independently of its 
location in the multimedia stream. The method also permits for easy error protection 
in multimedia gateways. The preferred embodiment describes the method applied to a 
multimedia compresaon scheme built around a matching pursuit algorithm, although 
the method is applicable to any data streams, including 1-D, 2-D and 3-D encoded 
streams. 



Brief Description of the Drawings 



The advantages of the present invention will become readily apparent to those 
ordinarily skilled in the art after reviewing the following detailed description and 
accompanying drawings, wherein: 

FIG. 1 is a block diagram illustrating the overall architecture in which the present 
invention takes place; 
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FIG. 2 illustrates the Signal Transform Block 100 from FIG. 1; 

FIG. 3 is a flow graph illustrating the Matching Pursuit iterative algorithm of FIG. 2; 

FIG. 4 shows an example of a spatio-temporal dictionary fimction in accordance 
with the present invention; 

FIG. 5 shows an example of video signal reconstruction after 100 Matching Pursuit 
iterations; 

FIG. 6 shows an example of video signal reconstruction after 500 Matchmg Pursuit 
iterations; 

FIG. 7 is a block diagram illustrating the inventive packetization; 

FIG. 8 illustrates a transmission packet which encapsulates Matching Pursuit 
iterations, wherein each iteration 801 is composed of an atom index and its respective 
coeflScient, both confuted by a Matching Pursuit encoder; and 

FIG, 9 is a flow chart dq>icting the inventive packetization process. 
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Detafled Description of tbe Invention 



The present invention is directed to packetization of streams to ensure packets of 
equal perceptual relevance. As noted above, the inventive system and method apply to 
1-D, 2-D and 3-D encoded streams. The preferred embodiment is directed to the 
delivery of 3-D encoded streams, and more particularly to signals encoded using a 3-D 
Matching Pursuit Algorithm, as covered by the above-referenced co-pending 
application. The 3-D encoding of the co-pending application will be detailed below 
for the sake of conq)leteness. 

The co-pending mvention applies a Matching Pursuit algorithm to encoded 3-D 
agnals and defines a sQ)arable 3-D structured dictionary. The resulting representation 
of the input signal is highly resistant to data loss (non-orthogonal transforms). Also, it 
improves the source coding quality versus decoding requirraients for a given target bit 
rate (anisotropy of the dictionary). 

Matching Pursuit (MP) is an adaptive algorithm tiiat iteratively decon^oses a 
function /eL^(9l) (e.g., image, video) oves: a possibly redundant dictionary of 
functions called atoms (see Figure 3). Let D = jg^^ be such a dictionary with 

|jg^ 1 = 1. / is first decomposed into: 

f = {gro\f)gro+Rf> 
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where (^g^^ \ f)gy^ represmts the projection of / onto g^^ ^/ the residual 
component. Since all elements m D have a unit norm, g^o is orthogonal to Rf , and 
this leads to: 

In order to minimize \Rf \ and thus optimize compression, one must choose g^^ such 
that the projection coefficient j^g^^^ I /)| ^s at a maximum. The pursuit is carried 

fiirther by applying the same strategy to the residual component. After N iterations, 
one has the following decomposition for / : 

with, i?*/ = / . Sinularly, the energy ||/|f is decomposed into: 

Although matching pursuit places very few restrictions on the dictionary set, the 
structure of the latter is strongly related to convergence speed and thus to coding 

efficiency. The decay of the residual energy has indeed been shown to be 
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upper-bounded by an exponential, whose parameters depend on the dictionary. 
However, true optimization of the dictionary can be very difficuh. Any collection of 
arbitrarily sized and shaped fimctions can be used, as long as completeness is 
respected. 

The 3-D encoding method is usefiil in a variety of applications where it is desired to 
produce a low to medium bit rate video stream to be delivered over an error-prone 
network and decoded by a set of heterogeneous devices. Let first the dictionary 
define the set of basic fimctions used for the signal representation. The basic fimctions 
are called atoms. The atoms are represented by a possibly multi-dimensional index 
and the index along with a correlation coeflScient c^^ forms an MP iteration. 

As illustrated in FIG. 2, the original video signal / is first passed to a Frame 
Buffer 101 to form groups of A' video fii-ames of dimension Xx7 . The method thus 
decomposes the input video sequence into K-fi'ames long independent 3D blocks. The 
dictionary 102 is composed of atoms, which are also 3-D fimctions of the same size, 
i.e., X Z X r . The method as shown in FIG. 3 iteratively compares the residual 3-D 
fimction with the dictionary atoms and elects in the Pattern Matcher 103 the 3-D atom 
that best matches the residual signal (i.e,, the atom which best correlates with the 
residual signal). The parameters of the elected atom, which are the index y and iJie 
coefficient c^^ are sent across to the following block performing the Coding (i.e., 

quantization, entropy coding probably followed by channel coding, as shown in FIG. 
1). The pursuit continues up to a predefined number of iterations N, which is either 
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imposed by the user, or deduced from a rate constraint and/or a source coding quality 
constraint. 

The method relies on a structured 3-D dictionary 102, which allows for a good trade- 
off between dictionary size and compression efficiency. In our method, the dictionary 
is constructed from separable temporal and spatial functions, since features to capture 
are different in spatial and temporal domains. An atom dictionary is therefore written 
as gy(x,y,k) = W^ ^S^ix,y)xT^(k), where y corresponds to the parameters that 
transform the generating flmctioa The parameter T is chosen so that each atom is 
normalized, i.e., ||gy(x:,j^,A:)||^ =1. Each entry of the dictionary therefore consists in a 

series of 7 parameters. The first 5 parameters specify position, dilation and rotation of 
the spatial function of the atom, S^ix.y) . The last 2 parameters specify the position 

and dilation of the temporal part of the atom, {k) . 

The spatial function in the method is generated using B-splines, which present the 
advantages of having a ihnited and calculable support, and optimizes the trade-oflF 
between compression efficiency (i.e., source coding quality for a given target bit rate) 
and decoding requirements (i.e., CPU and memory requirements to decode the mput 
bit stream). A B-splme of order n is given by: 



1 



(-ly 

where [y^ represents the positive part of j^". 



, n + 1 
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The 2-D B-spline is formed with a 3rd order B-spline in one direction, and its &st 
derivative in the orthogonal direction to catch edges and contours. Rotation, 
trandation and anisotropic dilation of the B-spHne gaierates an overcomplete 
dictionary. The anisotropic refinement permits to use different dilation along the 
orthogonal axes, in oppoaticm to Gabor atoms. Our spatial dictionary maximizes the 
trade-off between coding qu#ty and decoding complexity for a specified source rate. 
The spatial function of the 3-D atoms can be written as = 5* x 5^ , witii: 



+- 



sin(^>)(x-/?J-cos(^)0-Py) 1 



The index Ys is thus given by 5 parameters; these are two parameters to describe an 
atom's spatial position {p^,Py), two parameters to desaibe the i^)atial dilation of the 
atom , and the rotation parameter p . 

The temporal function is deagned to efficiently <»pture the redundancy between 
adjacent video firames. Therefore (k) is a sunple rectangular function written as: 
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^ [0 otherwise J ' 



The tmporal index is here given by 2 parameters; these are one parameter to 
describe the atom's temporal position and one parameter to describe the temporal 
dilation e/^ . 

The index parameters range {p^,Py.Pk,^^,dy,d^,(p) is designed to cover the size 

of the input signal. Spatial-temporal positions allow to completely browse the 3D 
input signal, and the dilations values follow an exponential distribution up to the 3D 
input signal size. The basis functions may however be trained on typical input signal 
sets to determine a minimal dictionary, trading off the compression efficiency. 

FIG. 1 is a block diagram illustrating the overall architecture in which the 3-D 
encoding takes place. The Signal Transform block 100 is the focus of the co-pending 
application at which the foregoing transformation takes place. After transformation, 
the digital signal is quantized 200, entropy coded 300 and packetized 400 for delivery 
over the error-prone network 500. A wide range of decoding devices are targeted; 
from a high-end PC 600, to PDAs 700 and wireless devices 800. 

FIG. 2 illustrates the Signal Transforai Block 100. The video sequence is fed into a 

frame buffer 101, and where a spatio-temporal signal is formed. This signal is 

it^atively compared to functions of a Pattern Library 102 through a Pattem Matcher 

103. The parameters of the chosen atoms are then sent to the quantization block 200, 

and the corresponding features are subtracted from the input spatio-temporal signal. 
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FIG. 3 is a flow chart illustrating the Matching Pursuit iterative algorithm of FIG. 2. 
The Residual signal 101, which consists m the input video signal at the beginning of 
the Pursuit, is compared to a library of fimctions and the best matching atom is elected 
by a Pattern matcher 103. The contribution of the chosen atom is removed from the 
residual signal 104 to form the residual signal of the next iteration. 

The Pattern Matcher 303 basically comprises an iterative loop within the MP 
algorithm main loop, as shown in FIG. 3. The residual signal is compared with the 
fiinctions of the dictionary by computing, pixel-wise, the correlation coefficient 
between the residual signal and the atom. The square of the correlation coefiBcient 
represents the energy of the atom (107). The atom with the highest energy (112) is 
considered as the atom that best matches the residual signal characteristics and is 
elected by the Pattern Matcher. The atom index and parameters and sent across (118) 
the Entropy Coder as shoAvn in FIG. 2, and the residual signal is updated in 
consequence (104). To increase the speed of the encoding, the best atom search can 
be performed only on a well-chosen subset of the dictionary functions. However, such 
a method may result in a sub-optimal signal representation. 

FIG. 4 shows an example of a spatio-temporal dictionary fimction for use with the 
present invention. FIG. 5 shows an example of video signal reconstruction after 100 
Matching Pursuit iterations. FIG. 6 shows an example of video signal reconstruction 
after 500 Matching Pursuit iterations. Clearly the amount of signal information 
improves with successive ita-ations. 

Given the output of the Matching Pursuit algorithm, the inventive packetization 
method next provides a way to distribute the atoms of an audio, image or video 
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segment into a given number of packets. As noted above^ the packetization method 
can be applied to 1 -dimensional, 2-dimensional, or 3-dimetisional compressed signals. 
The number of iterations is inq)osed by the compression algorithm and directly 
impacts the coding rate and quality. It has been shown in the literature that the energy 
iteratively captured by each atom is exponentially decreasing. This property is at the 
heart of the proposed method. 

FIG. 7 is a block diagram illustrating the inventive packetization. The Matching 
Pursuit iteration stream 700, where an iteration means an atom index, along with the 
respective correlation coefiBcients, is packetized into N equivalent energy packets 200. 
The number of packets N is given by the negotiated transmission rate and packet size. 
The number of iterations fed into each packet (i.e., the Ki values) is given by a 
recurrence formula presented below. Iterations are considered as basic entities and an 
entire number of iterations is fed into each packet. The packetization process 
terminates when all iterations have been encapsulated. 

FIG. 8 illustrates a transmission packet which encapsulates Matching Pursuit 
iterations. An iteration 801 is composed of an atom index and its respective 
coefficient both computed by a Matching Pursuit encoder. The packetization method 
is appUcable to any encoded stream obtained by transforming the origmal signal with 
either a non-linear transform (e.g., matching pursuit) or a linear transform (e.g.. 
Discrete Cosine Transform or wavelets) followed by a non-linear operation to insure 
the decreasing-energy ordering of the transform coefficients. The transform 
coefficients include, in the special case of matching pursuit transform, the illustrated 
correlation coefficients and the parameters of the set of atoms constituting the encoded 
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stream. The packetization method takes advantage of the fact that the energy of an 
atom decreases exponentially with the iteration number. Therefore, by staggering the 
packets into which successive atoms are placed, the relative energy of each packet can 
be equalized. 

The packetization method works as follows (see Fig. 9) assuming the number of 
packets N per audio, image or video segment is given. The number of packets N is 
generally computed once the length of the data segment (i.e., the number of iterations 
used to code the signal f) and the average packet size (given by the transmission 
settings) are known. The packetization basically copies the MP stream iterations into 
packets in two very similar loops. Along each loop, an increasmg number of iterations 
is copied mto each transmission packet, so that every packet contains the same energy. 
In the first loop, the packets are taken in a forward order. The scanning order is 
reversed in the second loop to balance the packet size. 

At initialization 901, the packet number p is set to 1 and the index k is set to 1 (ko = 
1), An iteration represents the smallest independent entity in the packetization process 
and con^rises an atom and its respective coefficient (see Fig, 8), Next the values of ki 
are computed 902 according to the following recursive relation, where v is the decay 
parameter of the exponential mentioned here above: 



The parameter v only depends on the dictionary used m the Matching Pursuit and is 
giv^ as an input parameter to the packetization algorithm. The number of packets N 
is given by the negotiated transmisaon rate and packet size. The 1q values are 




log(u) 
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computed in such a way that the same energy is put into every packet, assuming an 
e>q)onential energy decay along the MP stream. The number of iterations 903 copied 
into each packet at 904 is directly given by the ki parameters. The packet number p is 
then incremented at 90S, and the process is repeated as long as the packet number is 
smaller than N as determined at 906. When the packetization process reaches the 
packet, it begins another loop 911, resetting p to 1 (912) but using the same ki values 
907 as in the previous loop. The second loop however inverses the packet order in 
908, whereby the next k iterations are copies into packet N-p. The packetization 
proceeds in two loops taking feeding packets in an alternating manner to bdance the 
packet sizes. The packet number is then incremented at 913 and the process repeats 
the same loop while the packet number is smaller than N as determined at 914. When 
the packet number is equal to N, the process switches to the first loop, resetting p to 1 
(910), The packetization process terminates when all iterations have been 
encapsulated, as determined at steps 909 and 915. 

Upon completion, the disclosed process will have encapsulated all iterations into 
data packets having the same energy and the same resulting visual significance. 
Consequently, as the packets are being streamed, the loss of any single packet will 
have minimal perceptible impact on the display being consumed by the end user. 

The invention has been detailed in terms of preferred embodiments such as 
Matching Pursuit compression of 3D signals. One having skill in the art will 
recognize that modifications may be made without departing fi-om the spirit and scope 
of the invention as set forth m the appended claims, such that DCT compression and 
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other operations yielding decreasing-energy ordering of transform coefficients for ID, 
2D or 3D signals can make use of the inventive packetization method. 
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